Assigning Document Identifiers to Enhance Compressibility of Fulltext Indices

نویسندگان

  • Salvatore Orlando
  • Raffaele Perego
  • Fabrizio Silvestri
چکیده

Index compression has been a major issue in the field of Information Retrieval Systems. In particular, due to the impressive figures involved with Web Search Engines (WSEs) the compression of the index is not an option anymore but it has become a must. The most important index compression methods are designed to work for Inverted File (IF) indexes. These methods are based on the assumption that the posting lists are stored as sequences of d gaps (i.e. differences among successive document identifiers). The compression is thus carried out by using variable length encoding methods which represents smaller number using a smaller number of bits. In this paper, instead of focusing on finding a novel encoding method, we propose an algorithm which allows the assignment of identifiers to documents in a way that minimizes the average values of d gaps. The simulations performed on a real dataset, i.e. the Google contest collection, show that our approach allows to obtain an IF index which is, depending on the d gap encoding chosen, up to 23% smaller than the one built over randomly assigned document identifiers. Moreover, we will show, both analytically and empirically, that the complexity of our algorithm is linear in space and time.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Full-Text and Structural XML Indexing on B+-Tree

XML query processing is one of the most active areas of database research. Although the main focus of past research has been the processing of structural XML queries, there are growing demands for a full-text search for XML documents. In this paper, we propose XICS (XML Indices for Content and Structural search), novel indices built on a B-tree, for the fast processing of queries that involve s...

متن کامل

Full-Text and Structural Indexing of XML Documents on B+-Tree

XML query processing is one of the most active areas of database research. Although the main focus of past research has been the processing of structural XML queries, there are growing demands for a fulltext search for XML documents. In this paper, we propose XICS (XML Indices for Content and Structural search), which aims at high-speed processing of both full-text and structural queries in XML...

متن کامل

A Higher Order Online Lyapunov-Based Emotional Learning for Rough-Neural Identifiers

o enhance the performances of rough-neural networks (R-NNs) in the system identification‎, ‎on the base of emotional learning‎, ‎a new stable learning algorithm is developed for them‎. ‎This algorithm facilitates the error convergence by increasing the memory depth of R-NNs‎. ‎To this end‎, ‎an emotional signal as a linear combination of identification error and its differences is used to achie...

متن کامل

Sorting Out the Document Identifier Assignment Problem

The compression of Inverted File indexes in Web Search Engines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In this paper we are going to empirically show that in the case of collections of Web Documents we can enhance ...

متن کامل

Identifier Namespaces in Mathematical Notation

In Computer Science, namespaces help to structure source code and organize it into hierarchies. Initially, the concept of namespaces did not exist for programming languages, and programmers had to manage the source code themselves to ensure there were no name conflicts. However, nowadays, namespaces are adopted by the majority of modern programming languages. The concept of namespaces is benefi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004